INTERSPEECH.2017 - Analysis and Assessment

Total: 34

#1 Phone Classification Using a Non-Linear Manifold with Broad Phone Class Dependent DNNs [PDF] [Copy] [Kimi1]

Authors: Linxue Bai ; Peter Jančovič ; Martin Russell ; Philip Weber ; Steve Houghton

Most state-of-the-art automatic speech recognition (ASR) systems use a single deep neural network (DNN) to map the acoustic space to the decision space. However, different phonetic classes employ different production mechanisms and are best described by different types of features. Hence it may be advantageous to replace this single DNN with several phone class dependent DNNs. The appropriate mathematical formalism for this is a manifold. This paper assesses the use of a non-linear manifold structure with multiple DNNs for phone classification. The system has two levels. The first comprises a set of broad phone class (BPC) dependent DNN-based mappings and the second level is a fusion network. Various ways of designing and training the networks in both levels are assessed, including varying the size of hidden layers, the use of the bottleneck or softmax outputs as input to the fusion network, and the use of different broad class definitions. Phone classification experiments are performed on TIMIT. The results show that using the BPC-dependent DNNs provides small but significant improvements in phone classification accuracy relative to a single global DNN. The paper concludes with visualisations of the structures learned by the local and global DNNs and discussion of their interpretations.

#2 An Investigation of Crowd Speech for Room Occupancy Estimation [PDF] [Copy] [Kimi]

Authors: Siyuan Chen ; Julien Epps ; Eliathamby Ambikairajah ; Phu Ngoc Le

Room occupancy estimation technology has been shown to reduce building energy cost significantly. However speech-based occupancy estimation has not been well explored. In this paper, we investigate energy mode and babble speaker count methods for estimating both small and large crowds in a party-mode room setting. We also examine how distance between speakers and microphone affects their estimation accuracies. Then we propose a novel entropy-based method, which is invariant to different speakers and their different positions in a room. Evaluations on synthetic crowd speech generated using the TIMIT corpus show that acoustic volume features are less affected by distance, and our proposed method outperforms existing methods across a range of different conditions.

#3 Time-Frequency Coherence for Periodic-Aperiodic Decomposition of Speech Signals [PDF] [Copy] [Kimi1]

Authors: Karthika Vijayan ; Jitendra Kumar Dhiman ; Chandra Sekhar Seelamantula

Decomposing speech signals into periodic and aperiodic components is an important task, finding applications in speech synthesis, coding, denoising, etc. In this paper, we construct a time-frequency coherence function to analyze spectro-temporal signatures of speech signals for distinguishing between deterministic and stochastic components of speech. The narrowband speech spectrogram is segmented into patches, which are represented as 2-D cosine carriers modulated in amplitude and frequency. Separation of carrier and amplitude/frequency modulations is achieved by 2-D demodulation using Riesz transform, which is the 2-D extension of Hilbert transform. The demodulated AM component reflects contributions of the vocal tract to spectrogram. The frequency modulated carrier (FM-carrier) signal exhibits properties of the excitation. The time-frequency coherence is defined with respect to FM-carrier and a coherence map is constructed, in which highly coherent regions represent nearly periodic and deterministic components of speech, whereas the incoherent regions correspond to unstructured components. The coherence map shows a clear distinction between deterministic and stochastic components in speech characterized by jitter, shimmer, lip radiation, type of excitation, etc. Binary masks prepared from the time-frequency coherence function are used for periodic-aperiodic decomposition of speech. Experimental results are presented to validate the efficiency of the proposed method.

#4 Musical Speech: A New Methodology for Transcribing Speech Prosody [PDF] [Copy] [Kimi1]

Authors: Alexsandro R. Meireles ; Antônio R.M. Simões ; Antonio Celso Ribeiro ; Beatriz Raposo de Medeiros

Musical Speech is a new methodology for transcribing speech prosody using musical notation. The methodology presented in this paper is an updated version of our work [12]. Our work is situated in a historical context with a brief survey of the literature of speech melodies, in which we highlight the pioneering works of John Steele, Leoš Janávcek, Engelbert Humperdinck, and Arnold Schoenberg, followed by a linguistic view of musical notation in the analysis of speech. Finally, we present the current state-of-the-art of our innovative methodology that uses a quarter-tone scale for transcribing speech, and shows some initial results of the application of this methodology to prosodic transcription.

#5 Estimation of Place of Articulation of Fricatives from Spectral Characteristics for Speech Training [PDF] [Copy] [Kimi1]

Authors: K.S. Nataraj ; Prem C. Pandey ; Hirak Dasgupta

A visual feedback of the place of articulation is considered to be useful for speech training aids for hearing-impaired children and for learners of second languages in helping them in improving pronunciation. For such applications, the relation between place of articulation of fricatives and their spectral characteristics is investigated using English fricatives available in the XRMB database, which provides simultaneously acquired speech signal and articulogram. Place of articulation is estimated from the articulogram as the position of maximum constriction in the oral cavity, using an automated graphical technique. The magnitude spectrum is smoothed by critical band based median and mean filters for improving the consistency of the spectral parameters. Out of several spectral parameters investigated, spectral moments and spectral slope appear to be related to the place of articulation of the fricative segment of the utterances as measured from articulogram. The data are used to train and test a Gaussian mixture model to estimate the place of articulation with spectral parameters as the inputs. The estimated values showed a good match with those obtained from the articulograms.

#6 Estimation of the Probability Distribution of Spectral Fine Structure in the Speech Source [PDF] [Copy] [Kimi1]

Author: Tom Bäckström

The efficiency of many speech processing methods rely on accurate modeling of the distribution of the signal spectrum and a majority of prior works suggest that the spectral components follow the Laplace distribution. To improve the probability distribution models based on our knowledge of speech source modeling, we argue that the model should in fact be a multiplicative mixture model, including terms for voiced and unvoiced utterances. While prior works have applied Gaussian mixture models, we demonstrate that a mixture of generalized Gaussian models more accurately follows the observations. The proposed estimation method is based on measuring the ratio of Lp-norms between spectral bands. Such ratios follow the Beta-distribution when the input signal is generalized Gaussian, whereby the estimated parameters can be used to determine the underlying parameters of the mixture of generalized Gaussian distributions.

#7 Low-Dimensional Representation of Spectral Envelope Without Deterioration for Full-Band Speech Analysis/Synthesis System [PDF] [Copy] [Kimi1]

Authors: Masanori Morise ; Genta Miyashita ; Kenji Ozawa

A speech coding for a full-band speech analysis/synthesis system is described. In this work, full-band speech is defined as speech with a sampling frequency above 40 kHz, whose Nyquist frequency covers the audible frequency range. In prior works, speech coding has generally focused on the narrow-band speech with a sampling frequency below 16 kHz. On the other hand, statistical parametric speech synthesis currently uses the full-band speech, and low-dimensional representation of speech parameters is being used. The purpose of this study is to achieve speech coding without deterioration for full-band speech. We focus on a high-quality speech analysis/synthesis system and mel-cepstral analysis using frequency warping. In the frequency warping function, we directly use three auditory scales. We carried out a subjective evaluation using the WORLD vocoder and found that the optimum number of dimensions was around 50. The kind of frequency warping did not significantly affect the sound quality in the dimensions.

#8 Robust Source-Filter Separation of Speech Signal in the Phase Domain [PDF] [Copy] [Kimi1]

Authors: Erfan Loweimi ; Jon Barker ; Oscar Saz Torralba ; Thomas Hain

In earlier work we proposed a framework for speech source-filter separation that employs phase-based signal processing. This paper presents a further theoretical investigation of the model and optimisations that make the filter and source representations less sensitive to the effects of noise and better matched to downstream processing. To this end, first, in computing the Hilbert transform, the log function is replaced by the generalised logarithmic function. This introduces a tuning parameter that adjusts both the dynamic range and distribution of the phase-based representation. Second, when computing the group delay, a more robust estimate for the derivative is formed by applying a regression filter instead of using sample differences. The effectiveness of these modifications is evaluated in clean and noisy conditions by considering the accuracy of the fundamental frequency extracted from the estimated source, and the performance of speech recognition features extracted from the estimated filter. In particular, the proposed filter-based front-end reduces Aurora-2 WERs by 6.3% (average 0–20 dB) compared with previously reported results. Furthermore, when tested in a LVCSR task (Aurora-4) the new features resulted in 5.8% absolute WER reduction compared to MFCCs without performance loss in the clean/matched condition.

#9 A Time-Warping Pitch Tracking Algorithm Considering Fast f0 Changes [PDF] [Copy] [Kimi1]

Authors: Simon Stone ; Peter Steiner ; Peter Birkholz

Accurately tracking the fundamental frequency (f0) or pitch in speech data is of great interest in numerous contexts. All currently available pitch tracking algorithms perform a short-term analysis of a speech signal to extract the f0 under the assumption that the pitch does not change within a single analysis frame, a simplification that introduces errors when the f0 changes rather quickly over time. This paper proposes a new algorithm that warps the time axis of an analysis frame to counteract intra-frame f0 changes and thus to improve the total tracking results. The algorithm was evaluated on a set of 4718 sentences from 20 speakers (10 male, 10 female) and with added white and babble noise. It was comparative in performance to the state-of-the-art algorithms RAPT and PRAAT to Pitch (ac) under clean conditions and outperformed both of them under noisy conditions.

#10 A Modulation Property of Time-Frequency Derivatives of Filtered Phase and its Application to Aperiodicity and fo Estimation [PDF] [Copy] [Kimi1]

Authors: Hideki Kawahara ; Ken-Ichi Sakakibara ; Masanori Morise ; Hideki Banno ; Tomoki Toda

We introduce a simple and linear SNR (strictly speaking, periodic to random power ratio) estimator (0 dB to 80 dB without additional calibration/linearization) for providing reliable descriptions of aperiodicity in speech corpus. The main idea of this method is to estimate the background random noise level without directly extracting the background noise. The proposed method is applicable to a wide variety of time windowing functions with very low sidelobe levels. The estimate combines the frequency derivative and the time-frequency derivative of the mapping from filter center frequency to the output instantaneous frequency. This procedure can replace the periodicity detection and aperiodicity estimation subsystems of recently introduced open source vocoder, YANG vocoder. Source code of MATLAB implementation of this method will also be open sourced.

#11 Non-Local Estimation of Speech Signal for Vowel Onset Point Detection in Varied Environments [PDF] [Copy] [Kimi1]

Authors: Avinash Kumar ; S. Shahnawazuddin ; Gayadhar Pradhan

Vowel onset point (VOP) is an important information extensively employed in speech analysis and synthesis. Detecting the VOPs in a given speech sequence, independent of the text contexts and recording environments, is a challenging area of research. Performance of existing VOP detection methods have not yet been extensively studied in varied environmental conditions. In this paper, we have exploited the non-local means estimation to detect those regions in the speech sequence which are of high signal-to-noise ratio and exhibit periodicity. Mostly, those regions happen to be the vowel regions. This helps in overcoming the ill-effects of environmental degradations. Next, for each short-time frame of estimated speech sequence, we cumulatively sum the magnitude of the corresponding Fourier transform spectrum. The cumulative sum is then used as the feature to detect the VOPs. The experiments conducted on TIMIT database show that the proposed approach provides better results in terms of detection and spurious rate when compared to a few existing methods under clean and noisy test conditions.

#12 Time-Domain Envelope Modulating the Noise Component of Excitation in a Continuous Residual-Based Vocoder for Statistical Parametric Speech Synthesis [PDF] [Copy] [Kimi1]

Authors: Mohammed Salah Al-Radhi ; Tamás Gábor Csapó ; Géza Németh

In this paper, we present an extension of a novel continuous residual-based vocoder for statistical parametric speech synthesis. Previous work has shown the advantages of adding envelope modulated noise to the voiced excitation, but this has not been investigated yet in the context of continuous vocoders, i.e. of which all parameters are continuous. The noise component is often not accurately modeled in modern vocoders (e.g. STRAIGHT). For more natural sounding speech synthesis, four time-domain envelopes (Amplitude, Hilbert, Triangular and True) are investigated and enhanced, and then applied to the noise component of the excitation in our continuous vocoder. The performance evaluation is based on the study of time envelopes. In an objective experiment, we investigated the Phase Distortion Deviation of vocoded samples. A MUSHRA type subjective listening test was also conducted comparing natural and vocoded speech samples. Both experiments have shown that the proposed framework using Hilbert and True envelopes provides high-quality vocoding while outperforming the two other envelopes.

#13 Wavelet Speech Enhancement Based on Robust Principal Component Analysis [PDF] [Copy] [Kimi1]

Authors: Chia-Lung Wu ; Hsiang-Ping Hsu ; Syu-Siang Wang ; Jeih-Weih Hung ; Ying-Hui Lai ; Hsin-Min Wang ; Yu Tsao

Most state-of-the-art speech enhancement (SE) techniques prefer to enhance utterances in the frequency domain rather than in the time domain. However, the overlap-add (OLA) operation in the short-time Fourier transform (STFT) for speech signal processing possibly distorts the signal and limits the performance of the SE techniques. In this study, a novel SE method that integrates the discrete wavelet packet transform (DWPT) and a novel subspace-based method, robust principal component analysis (RPCA), is proposed to enhance noise-corrupted signals directly in the time domain. We evaluate the proposed SE method on the Mandarin hearing in noise test (MHINT) sentences. The experimental results show that the new method reduces the signal distortions dramatically, thereby improving speech quality and intelligibility significantly. In addition, the newly proposed method outperforms the STFT-RPCA-based speech enhancement system.

#14 Vowel Onset Point Detection Using Sonority Information [PDF] [Copy] [Kimi1]

Authors: Bidisha Sharma ; S.R. Mahadeva Prasanna

Vowel onset point (VOP) refers to the starting event of a vowel, that may be reflected in different aspects of the speech signal. The major issue in VOP detection using existing methods is the confusion among the vowels and other categories of sounds preceding them. This work explores the usefulness of sonority information to reduce this confusion and improve VOP detection. Vowels are the most sonorant sounds followed by semivowels, nasals, voiced fricatives, voiced stops. The sonority feature is derived from the vocal-tract system, excitation source and suprasegmental aspects. As this feature has the capability to discriminate among different sonorant sound units, it reduces the confusion among onset of vowels with that of other sonorant sounds. This results in improved detection and resolution of VOP detection for continuous speech. The performance of proposed sonority information based VOP detection is found to be 92.4%, compared to 85.2% by the existing method. Also the resolution of localizing VOP within 10 ms is significantly enhanced and a performance of 73.0% is achieved as opposed to 60.2% by the existing method.

#15 Analytic Filter Bank for Speech Analysis, Feature Extraction and Perceptual Studies [PDF] [Copy] [Kimi1]

Author: Unto K. Laine

Speech signal consists of events in time and frequency, and therefore its analysis with high-resolution time-frequency tools is often of importance. Analytic filter bank provides a simple, fast, and flexible method to construct time-frequency representations of signals. Its parameters can be easily adapted to different situations from uniform to any auditory frequency scale, or even to a focused resolution. Since the Hilbert magnitude values of the channels are obtained at every sample, it provides a practical tool for a high-resolution time-frequency analysis. The present study describes the basic theory of analytic filters and tests their main properties. Applications of analytic filter bank to different speech analysis tasks including pitch period estimation and pitch synchronous analysis of formant frequencies and bandwidths are demonstrated. In addition, a new feature vector called group delay vector is introduced. It is shown that this representation provides comparable, or even better results, than those obtained by spectral magnitude feature vectors in the analysis and classification of vowels. The implications of this observation are discussed also from the speech perception point of view.

#16 Learning the Mapping Function from Voltage Amplitudes to Sensor Positions in 3D-EMA Using Deep Neural Networks [PDF] [Copy] [Kimi1]

Authors: Christian Kroos ; Mark D. Plumbley

The first generation of three-dimensional Electromagnetic Articulography devices (Carstens AG500) suffered from occasional critical tracking failures. Although now superseded by new devices, the AG500 is still in use in many speech labs and many valuable data sets exist. In this study we investigate whether deep neural networks (DNNs) can learn the mapping function from raw voltage amplitudes to sensor positions based on a comprehensive movement data set. This is compared to arriving sample by sample at individual position values via direct optimisation as used in previous methods. We found that with appropriate hyperparameter settings a DNN was able to approximate the mapping function with good accuracy, leading to a smaller error than the previous methods, but that the DNN-based approach was not able to solve the tracking problem completely.

#17 The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016 [PDF] [Copy] [Kimi1]

Authors: Kong Aik Lee ; SRE’16 I4U Group

The 2016 speaker recognition evaluation (SRE’16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE’16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE’16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.

#18 The MIT-LL, JHU and LRDE NIST 2016 Speaker Recognition Evaluation System [PDF] [Copy] [Kimi1]

Authors: Pedro A. Torres-Carrasquillo ; Fred Richardson ; Shahan Nercessian ; Douglas Sturim ; William Campbell ; Youngjune Gwon ; Swaroop Vattam ; Najim Dehak ; Harish Mallidi ; Phani Sankar Nidadavolu ; Ruizhi Li ; Reda Dehak

In this paper, the NIST 2016 SRE system that resulted from the collaboration between MIT Lincoln Laboratory and the team at Johns Hopkins University is presented. The submissions for the 2016 evaluation consisted of three fixed condition submissions and a single system open condition submission. The primary submission on the fixed (and core) condition resulted in an actual DCF of .618. Details of the submissions are discussed along with some discussion and observations of the 2016 evaluation campaign.

#19 Nuance - Politecnico di Torino’s 2016 NIST Speaker Recognition Evaluation System [PDF] [Copy] [Kimi1]

Authors: Daniele Colibro ; Claudio Vair ; Emanuele Dalmasso ; Kevin Farrell ; Gennady Karvitsky ; Sandro Cumani ; Pietro Laface

This paper describes the Nuance–Politecnico di Torino (NPT) speaker recognition system submitted to the NIST SRE16 evaluation campaign. Included are the results of post-evaluation tests, focusing on the analysis of the performance of generative and discriminative classifiers, and of score normalization. The submitted system combines the results of four GMM-IVector models, two DNN-IVector models and a GMM-SVM acoustic system. Each system exploits acoustic front-end parameters that differ by feature type and dimension. We analyze the main components of our submission, which contributed to obtaining 8.1% EER and 0.532 actual Cprimary in the challenging SRE16 Fixed condition.

#20 UTD-CRSS Systems for 2016 NIST Speaker Recognition Evaluation [PDF] [Copy] [Kimi1]

Authors: Chunlei Zhang ; Fahimeh Bahmaninezhad ; Shivesh Ranjan ; Chengzhu Yu ; Navid Shokouhi ; John H.L. Hansen

This study describes systems submitted by the Center for Robust Speech Systems (CRSS) from the University of Texas at Dallas (UTD) to the 2016 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE).We developed 4 UBM and DNN i-vector based speaker recognition systems with alternate data sets and feature representations. Given that the emphasis of the NIST SRE 2016 is on language mismatch between training and enrollment/test data, so-called domain mismatch, in our system development we focused on: (i) utilizing unlabeled in-domain data for centralizing i-vectors to alleviate the domain mismatch; (ii) selecting the proper data sets and optimizing configurations for training LDA/PLDA; (iii) introducing a newly proposed dimension reduction technique which incorporates unlabeled in-domain data before PLDA training; (iv) unsupervised speaker clustering of unlabeled data and using them alone or with previous SREs for PLDA training, and finally (v) score calibration using unlabeled data with “pseudo” speaker labels generated from speaker clustering. NIST evaluations show that our proposed methods were very successful for the given task.

#21 Analysis and Description of ABC Submission to NIST SRE 2016 [PDF] [Copy] [Kimi1]

Authors: Oldřich Plchot ; Pavel Matějka ; Anna Silnova ; Ondřej Novotný ; Mireia Diez Sánchez ; Johan Rohdin ; Ondřej Glembek ; Niko Brümmer ; Albert Swart ; Jesús Jorrín-Prieto ; Paola García ; Luis Buera ; Patrick Kenny ; Jahangir Alam ; Gautam Bhattacharya

We present a condensed description and analysis of the joint submission for NIST SRE 2016, by Agnitio, BUT and CRIM (ABC). We concentrate on challenges that arose during development and we analyze the results obtained on the evaluation data and on our development sets. We show that testing on mismatched, non-English and short duration data introduced in NIST SRE 2016 is a difficult problem for current state-of-the-art systems. Testing on this data brought back the issue of score normalization and it also revealed that the bottleneck features (BN), which are superior when used for telephone English, are lacking in performance against the standard acoustic features like Mel Frequency Cepstral Coefficients (MFCCs). We offer ABC’s insights, findings and suggestions for building a robust system suitable for mismatched, non-English and relatively noisy data such as those in NIST SRE 2016.

#22 The 2016 NIST Speaker Recognition Evaluation [PDF] [Copy] [Kimi1]

Authors: Seyed Omid Sadjadi ; Timothée Kheyrkhah ; Audrey Tong ; Craig Greenberg ; Douglas Reynolds ; Elliot Singer ; Lisa Mason ; Jaime Hernandez-Cordero

In 2016, the National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE) to foster research in robust text-independent speaker recognition, as well as measure performance of current state-of-the-art systems. Compared to previous NIST SREs, SRE16 introduced several new aspects including: an entirely online evaluation platform, a fixed training data condition, more variability in test segment duration (uniformly distributed between 10s and 60s), the use of non-English (Cantonese, Cebuano, Mandarin and Tagalog) conversational telephone speech (CTS) collected outside North America, and providing labeled and unlabeled development (a.k.a. validation) sets for system hyperparameter tuning and adaptation. The introduction of the new non-English CTS data made SRE16 more challenging due to domain/channel and language mismatches as compared to previous SREs. A total of 66 research organizations from industry and academia registered for SRE16, out of which 43 teams submitted 121 valid system outputs that produced scores. This paper presents an overview of the evaluation and analysis of system performance over all primary evaluation conditions. Initial results indicate that effective use of the development data was essential for the top performing systems, and that domain/channel, language, and duration mismatch had an adverse impact on system performance.

#23 A Robust and Alternative Approach to Zero Frequency Filtering Method for Epoch Extraction [PDF] [Copy] [Kimi1]

Authors: P. Gangamohan ; B. Yegnanarayana

During production of voiced speech, there exists impulse-like excitations due to abrupt closure of vocal folds. These impulse-like excitations are often referred as epochs or glottal closure instants (GCIs). The zero frequency filtering (ZFF) method exploits the properties of impulse-like excitation by passing a speech signal through the resonator whose pole pair is located at 0 Hz. As the resonator is unstable, the polynomial growth/decay is observed in the filtered signal, thus requiring a trend removal operation. It is observed that the length of the window for trend removal operation is critical in speech signals where there are more fluctuations in the fundamental frequency (F0). In this paper, a simple finite impulse response (FIR) implementation is proposed. The FIR filter is designed by placing large number of zeros at (f_s)/(2)] Hz (fs represents the sampling frequency), closer to the unit circle, in the z-plane. Experimental results show that the proposed method is robust and computationally less complex when compared to the ZFF method.

#24 Improving YANGsaf F0 Estimator with Adaptive Kalman Filter [PDF] [Copy] [Kimi1]

Author: Kanru Hua

We present improvements to the refinement stage of YANGsaf[1] (Yet ANother Glottal source analysis framework), a recently published F0 estimation algorithm by Kawahara et al., for noisy/breathy speech signals. The baseline system, based on time-warping and weighted average of multi-band instantaneous frequency estimates, is still sensitive to additive noise when none of the harmonic provide reliable frequency estimate at low SNR. We alleviate this problem by calibrating the weighted averaging process based on statistics gathered from a Monte-Carlo simulation, and applying Kalman filtering to refined F0 trajectory with time-varying measurement and process distributions. The improved algorithm, adYANGsaf (adaptive Yet ANother Glottal source analysis framework), achieves significantly higher accuracy and smoother F0 trajectory on noisy speech while retaining its accuracy on clean speech, with little computational overhead introduced.

#25 A Spectro-Temporal Demodulation Technique for Pitch Estimation [PDF] [Copy] [Kimi1]

Authors: Jitendra Kumar Dhiman ; Nagaraj Adiga ; Chandra Sekhar Seelamantula

We consider a two-dimensional demodulation framework for spectro-temporal analysis of the speech signal. We construct narrowband (NB) speech spectrograms, and demodulate them using the Riesz transform, which is a two-dimensional extension of the Hilbert transform. The demodulation results in time-frequency envelope (amplitude modulation or AM) and time-frequency carrier (frequency modulation or FM). The AM corresponds to the vocal tract and is referred to as the vocal tract spectrogram. The FM corresponds to the underlying excitation and is referred to as the carrier spectrogram. The carrier spectrogram exhibits a high degree of time-frequency consistency for voiced sounds. For unvoiced sounds, such a structure is lacking. In addition, the carrier spectrogram reflects the fundamental frequency (F0) variation of the speech signal. We develop a technique to determine the F0 from the carrier spectrogram. The time-frequency consistency is used to determine which time-frequency regions correspond to voiced segments. Comparisons with the state-of-the-art F0 estimation algorithms show that the proposed F0 estimator has high accuracy for telephone channel speech and is robust to noise.